Conditional Gradient Algorithms for Rank-One Matrix Approximations 

with a Sparsity Constraint 

Ronny Lus^j and Marc TebouUe 

School of Mathematical Sciences 

Tel-Aviv University, Ramat-Aviv 69978, Israel 

email: ronnyluss@gmail.com, teboulle@post.tau.ac.il 

^1^ ■ June 29, 2012 

O 

Abstract 

^ . The sparsity constrained rank-one matrix approximation problem is a difficult mathematical opti- 

mization problem which arises in a wide array of useful applications in engineering, machine learning 
C^ , and statistics, and the design of algorithms for this problem has attracted intensive research activities. 

We introduce an algorithmic framework, called ConGradU, that unifies a variety of seemingly differ- 
ent algorithms that have been derived from disparate approaches, and allows for deriving new schemes. 
^^ ■ Building on the old and well-known conditional gradient algorithm, ConGradU is a simplified version 

^^ , with unit step size and yields a generic algorithm which either is given by an analytic formula or re- 

• ■ quires a very low computational complexity. Mathematical properties are systematically developed and 

'•pi , numerical experiments are given. 
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The problem of interest here is the sparsity constrained rank-one matrix approximation given by 

t^ I maxja; Ax : \\x\\2 = 1, ||a;||o < k,x £ R"}, (1) 

where ^4 S S" is a given real symmetric matrix, and 1 < /c < n is a parameter controlling the sparsity 
of X which is defined by counting the number of nonzero entries of x and denoted using the Iq notation: 
k> . ||a;||o = |{i : Xj 7^ 0}|. This problem is also commonly known as the sparse Principal Component Analysis 

?H I (PCA) problem, or as we refer to it, /q -constrained PCA. Without the Iq constraint, the problem reduces to 

finding the first principal eigenvector and the corresponding maximal eigenvalue of the matrix A, i.e., solves 

max{x Ax : \\x\\2 = 1, x G R"}, 

which is the PCA problem. 

Suppose A = B^B where B is anm x 7i mean-centered data matrix with m samples and n variables, 
and denote by v be the principal eigenvector of A, i.e., v solves the above PCA problem. Then Bv projects 
the data B to one dimension that maximizes the variance of the projected data. In general, PCA can be 
used to reduce B to I < n dimensions via the projection B{vi, . . . ,vi) with Vi as the i^^ eigenvector of A. 
In addition to dimensionality reduction, PCA can be used for visualization, clustering, and other tasks for 
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data analysis. Such tasks occur in vaiious fields, e.g., genetics lUlllZl. face recognition |[T6l[38l . and signal 
processing IJlQlfTTl . 

In PCA, the eigenvector is typically dense, i.e., each component of the eigenvector is nonzero, and 
hence the projected variables ai^e linear functions of all original n variables. In spai^se PCA, we restrict the 
number of variables used in this linear projection, thereby making it easier to interpret the projections. The 
additional /q constraint however makes problem (O a difficult and mathematically interesting problem which 
arises in many scientific and engineering applications where very large-scale data sets must be analyzed and 
interpreted. 

Not surprisingly, the search and development of adequate algorithms for solving problem ^ have thus 
received much attention in the past decade, and this will be discussed below. But first, we want to make 
clear the main purpose of this paper. We have three main goals: 

• To develop a novel and very simple approach to the /q -constrained PCA problem ([T]) as formulated 
and without any modifications, i.e., no relaxations or penalization, which is amenable to dimensions 
in the hundreds of thousands or even millions. 

• To present a "Father Algorithm", which we call ConGradU, based on the well-known first-order condi- 
tional gradient scheme, which is very simple, allows for a rigorous convergence analysis, and provides 
a family of cheap algorithms well-suited to solving various formulations of spai^se PCA. 

• To provide a closure and unification to many seemingly disparate approaches recently proposed, and 
which will be shown to be particular reaUzations of ConGradU. 

Most current approaches to sparse PCA can be categorized as solving one of several modified optimiza- 
tion problems based on penahzation, relaxations, or both, and include: 

(a) /i -constrained PCA: inax{x^Ax : \\x\\2 < 1, ||2;||i < ^/k,x G R"}, 

(b) /q -penalized PCA: maxlx'^'^x — s||2;||o : ||a;||2 < l,a^ £ R*^}) 

(c) Zi -penalized PCA: ina'x{x^Ax — s||a;||i : ||x||2 < l,a; G R*^}, 

(d) Approximate /o-penalized PCA: maxjx^Arc — sgp{x) : \\x\\2 < l,x £ R""}, 
where gp{x) ~ ||2;||o and p controls the approximation, 

(e) Convex relaxations. 

These models will be presented in detail in the forthcoming section, except for approach (e) which is not 
thoroughly discussed in this paper (but see ^l.ll below). 

In a nutshell, the original /q -constrained PCA problem ([T]) and the corresponding modified problems 
(a)-(d) above can be written or transformed in such a way that they reduce to maximizing a convex function 
over some compact set C C R*^: 

(P) ma^{F{x) : x e C}. 

When the problem of maximizing a linear function over the compact set C can be efficiently computed, 
or even better, can be obtained analytically, a very simple and natural iterative scheme to consider for solving 
(P) is the so-called conditional gradient algorithm |[2ll.[T2ll4 



"The conditional gradient scheme is also known as the Frank- Wolfe algorithm |15| . The latter was devised to minimize quadratic 
convex functions over a bounded polyhedron, while the former was extended mainly to solve convex minimization problems, see 
Section[3]for more precise details and relevant references. 



To achieve the goals alluded to above, in this paper, all developed algorithms for tackling problem (P) 
will be based on the conditional gradient scheme with a unit step size called ConGradU. At this juncture, it 
is important to notice that Mangasarian [24] seems to have been the first work suggesting and analyzing the 
conditional gradient algorithm with a unit step size for maximizing a convex function over a polyhedron, in 
the context of machine learning problems. 

A common and interesting appeal of the resulting algorithms is that they take the shape of a closed-form 
iterative scheme, i.e., they can be written as 

\\S{Ax^W' ' '■■■ 

where 5 : R" — )• R" is a simple operator that can be either written in explicit form or efficiently computeqj. 
All problems addressed here are difficult nonconvex optimization problems and we make no claims with 
respect to global optimality; after all, these are difficult problems so obtaining cheap solutions must have 
some cost which here is a certificate of global optimality. Moreover, an important driving point is that there 
is no reason to discount stationary solutions of nonconvex problems versus globally optimal solutions to 
convex relaxations. For neither solution do we have a measure of the gap to the optimal solution of problem 
([T]). We can empirically demonstrate similar solution quality, while the nonconvex methods are orders of 
magnitudes cheaper to compute and can be applied to data sets much larger than can be done with any 
known convex relaxation. This is in contrast to the well-known sparse recovery problem, for which there is 
an equivalence between the difficult combinatorial problem and the linear program relaxation when the data 
matrix satisfies certain conditions ||8l. To put in perspective the development and results of this paper, we 
first discuss some of the relevant literature that has motivated this work. 

1.1 Literature 

The literature on sparse PCA can be divided according to the different modifications discussed above. Here, 
we briefly survey these approaches and detail them further in Section [5] With respect to the /q -constrained 
PCA problem ([T]), thresholding [7] is perhaps the simplest approach, however is known to produce poor 
results. Spai^se low-rank approximations (SLRA) of [i39ll looks for a more general (uv^ rather than xx^) 
sparse approximation by taking an approximation error level as input and determining the sparsity level k 
that is required to satisfy the desired error. Greedy methods |[25l are also computationally expensive due to 
a maximum eigenvalue computation at each iteration, however an approximate greedy approach ifTOl offers 
a much cheaper way to derive an entire path of solutions (for each k = . . .n) which often suffice. These 
approaches are heuristics. The only globally optimal approach to this formulation is an exact search method 
ll25l that is applicable only to extremely small problems. 

The /i -constrained PCA problem is a relaxation, or more precisely an upper bound, to problem ([I]). This 
problem was first considered in [jlSil and called SCoTLASS (simplified component technique least absolute 
shrinkage and selection). It was motivated by the LASSO (least absolute selection and shrinkage operator) 
approach used in statistics |[33l for inducing sparsity in regression. In |[34l . a true penalty function is used 
to handle the /i constraint and the resulting problem is solved as a system of differential equations (which 
requires a smooth approximation for the li constraint). While this approach was tested solely on a small 13- 
dimensional data set, we mention it as the only true penalty function approach in the sparse PCA literature. 



'in fact, when S is the identity operator, the scheme is nothing else but the well known power method to compute the first 
principal eigenvector of the matrix A, see, e.g., 1321 . 



More recently, a computationally cheap approach to 11 -constrained PC A that can solve large-scale problems 
was given by |[36l . We will show that this scheme is an application of the conditional gradient algorithm. 

As already explained in the introduction, we are not interested in formulations that require expensive 
computations. Convex relaxations are either semidefinite-based or optimize over symmetric n xn matrices, 
and are, indeed, much more computationally expensive than what we would like to consider here. Never- 
theless, it is important to briefly recall some of these works. In HI], d'Aspremont et al. introduced a convex 
relaxation to /i -constrained PCA using semidefinite programming, however this formulation can only be 
solved on very small dimensions (< 100). This was the first approach with convex optimization to any 
sparse PCA modification and motivated a convex relaxation for ^i -penalized PCA for which better algo- 
rithms were given. Another approach for /i -constrained PCA in |[22l solves a different convex relaxation 
over S" based on a variational represention of the li ball ( this relaxation turned out to be the dual of the 
semidefinite relaxation in ifTTI ). This was the first convex relaxation for /i -constrained PCA amenable to a 
medium number of dimensions (1000-2000). 

We next turn to penalized sparse PCA where the sparsity-inducing term appears in the objective func- 
tion. Several approaches are known for ^o-penalized PCA. The heuristic given in |[39l is extended in BOl to 
this penalized version of PCA. In ifTOl . the problem is reformulated as an equivalent convex maximization 
problem to which a semidefinite relaxation is applied, however, as mentioned above, solving such problems 
is too computationally expensive. More recently, jTOll derived the same convex maximization problem (in 
a different manner) and proposed a first-order gradient-based algorithm which is identical to ConGradU. 
Another convex maximization representation was very recently presented in OTl . whereby a specific pa- 
rameterized concave approximation is used to replace the Iq term, and the resulting problem was solved by 
an iterative scheme called the minorization-maximization technique, which is, in fact, a specific instance of 
ConGradU. 

The Zi-penalized PCA problem can be solved by a convex relaxation as shown in ifTTl . but is amenable 
to only a medium number of dimensions. In BH . Zhou et al. considered a reformulation of PCA as a two 
variable regression problem to which an li penalty is added for one of the variables. While their approach is 
amenable to larger problems, it is not exactly Zi -penalized PCA and is still more computationally expensive 
(cf. Section [231) than other approaches we will discuss. Recently, [|191 reformulated /i -penalized PCA as 
an equivalent convex maximization problem (as with their ^o-penalized approach) that is also solved by the 
conditional gradient algorithm. 

As discussed above, the conditional gradient algorithm has previously been applied to spai^se PCA in 
various forms, so we now put the contributions of the above works into perspective. The works of |[36l l29l 
detail their iterative schemes (without any general algorithm) specifically for sparse PCA. OTl details a 
general algorithm, similar to ConGradU, but meant for maximizing the difference of convex functions over 
a convex set. While maximizing a convex function is a subset of this algorithm, ConGradU, as detailed 
in Section [3l allows for nonconvex sets. jTOll is the only previously known work that details a general 
algorithm for maximizing a convex function over a compact (possibly nonconvex) set. Indeed, the first- 
order algorithm proposed in |[T9l (labeled Algorithm 1 therein) is identical to what we term the ConGradU 
algorithm, however it is not recognized as the conditional gradient algorithm. As noticed in ifTOll . both 
the Iq and /i -penalized PCA algorithms they have proposed were earlier stated in [i29il . subject to slight 
modifications, who look for a general rank-one approximation (i.e., uv'^ rather than xx'^); however, no 
convergence results were stated in |[29ll . 



1.2 Outline 

We provide computationally simple approaches to both constrained and penalized versions of sparse PCA. 
It is important to recognize that all algorithms here are schemes for nonconvex problems; we pay the price 
of no global optimality criterion and gain in amenability to problem sizes that convex relaxations cannot 
handle. 

In Section [21 we define the problems of interest and some of their properties. Section [3] recalls some 
basic optimality results for maximizing a convex function over a compact set. We then detail ConGradU, a 
specific conditional gradient scheme with unit step size, and establish its convergence properties. Section |4] 
provides a mathematical toolbox proving a series of propositions that are used to develop the known cheap 
algorithms mentioned above, as well as for deriving new schemes. These propositions are simple and easy 
to prove so we believe it benefits the reader to go through these tools first. 

Section[5]then details the algorithms for all versions of sparse PCA. We start with a simple algorithm for 
the true /q -constrained PCA problem ([T]). To the author's knowledge, this is the first available scheme that 
directly approaches this problem, is amenable to large-scale problems and proven to converge to a stationary 
point of problem ([Hi. While the Iq constraint is a difficult nonsmooth and nonconvex constraint, we need 
not look for ways around this constraint, and rather we approach the given problem as is. The basis for 
our approach is the simple, yet surprising, result (cf. Section [4]) that while maximizing a quadratic function 
over the I2 unit ball with an Iq constraint is a difficult problem, maximizing a linear function over the same 
nonconvex set is simple and can be solved in 0{n) time. An important aspect of the new ^o-constrained 
PCA algorithm is that no parameters need be tuned in order to obtain a stationary point that has the exact 
desired sparsit>|j. The second main part of Section [5]focuses on all aforementioned iterative schemes which 
have been proposed in the literature. Building on the results of Section [H we show that all these schemes 
can directly be obtained as a particular realization of ConGradU, or of some variant of it, thus providing a 
unifying framework to various seemingly different algorithmic approaches. 

Section [6] provides experimental results and demonstrates the efficiency of many of the methods we 
have reviewed on large-scale problems. We show that they all give comparable solutions, i.e., very similar 
A:-sparse solutions, with the advantage of /q -constrained PCA being that the A;-sparse solution is directly 
obtained at a lower computational cost. Section [7] ends with concluding remarks and briefly shows how to 
use the same tools to develop simple algorithms for related sparsity constrained problems. 

1.3 Notation 

We write S" (S" ,S"^) to denote the set of symmetric (positive-semidefinite, positive-definite) matrices of 
size 71 and R" (R",R!J:^) to denote the set of (nonnegative, strictly positive) real vectors of size ?i. The 

vector e is the n-vector of ones. Given a vector x S R", ||3;||2 = {J2i^l)^ defines the I2 norm, J|x||o 
defines the cardinality of x, i.e., the number of nonzero entries of x and usually called here the Iq norrqj, and 
||x||oo = niax(|xi|, . . . , |x„|). Given a matrix X G S", H-'^Hoo = maxj^ X^j and \\X\\f = (Ylij-^ij)^- 
For a vector x G R", |3;| denotes the vector with z*'* entry \xi\, sgn(x) denotes the vector with i^^ entry 
-1,0,1 if Xi < 0,Xi = 0,Xi > 0, and x+ denotes the vector with i*'* entry max(rEj,0). For a vector 
X S R", diag(x) denotes the diagonal matrix with x on its diagonal. For any nonzero integer n, denote 
the set {1, . . . ,n} as [n]. Let /„ denote the identity matrix in dimension n. Let C^ denote the space of 



""ah other algorithm based on modifications can be used to obtain a desired sparsity as well, however parameters must be tuned 
accordingly. 

''We note an abuse of terminology because ||a;||o is not a true norm since it is not positively homogenous. 



once continuously differentiable functions on R". Given an optimization problem (P), we use argmax(i-*) 
to denote its optimal solutions set. 

2 Problem Formulations 

This section describes the relationship between /q -constrained PCA and the various modified sparse PCA 
problems that are discussed throughout the paper, as well as certain properties. 

2.1 The Original Optimization Model 

We start with some useful and elementary properties of the /q -constrained PCA problem. 

The /o-constrained PCA Problem 

Given a symmetric matrix A G 5" and sparsity level k € [l,n\, the main problem of interest is to solve the 
^o-constrained PCA problem (i.e., the sparse eigenvalue problem): 

{E) max{x'^Ax : \\x\\2 = 1, ||x||o <k,x£ R"}. (2) 

In most applications, A is the covariance matrix of some data matrix B G R™x" such that A = B^B and 
hence is positive semidefinite. The latter fact will be exploited in reformulations of the problem described 
below. In fact, as is very well-known, since problem (E) is constrained by the unit sphere, without loss of 
generality (see e.g., ifTOlfTTlfTQl ) we can always assume that A is positive semidefinite since we cleai^ly have 

maxjx Aa-x : \\x\\2 = 1, ||2;||o < A;) = maxjx Ax : llxlb = 1, l|x||o < k\ + a, 
xgR" xgR" 

with a > such that A^^ := A + ain G 5"++) i-C-. with respect to optimal objective values we are solving 
the same problem. 

The next result furthermore states that we can relax the sphere constraint to its convex counterpart, the 
unit ball, namely we consider the problem 

(Ea) max{x A^jX : \\x\\2 < 1, ||x||o < /c,x G R"} 

and also show that problems (E) and (£^cr) are equivalent and admit the same set of optimal solutions. The 
simple proof is omitted. 

Lemma 1 Fix any a > such that A^ G S^+- Then, 

(a) uiSi.yi{x'^ AfjX : \\x\\2 < 1, ||x||o < k} = m.ax{x^Ax : \\x\\2 = 1, ||x||o < k} + a. 

(b) argmax(£^) = argmax(£'o-)- 

Needless to say that both problems (E) and (Ea-) remain hard nonconvex problems as they consist of 
maximizing a convex (strongly convex) function over a compact set, and clearly the possibility of using 
the convex relaxation {x : \\x\\2 < 1} instead of the nonconvex unit sphere constraint does not change the 
situation. However, it is useful to know that either one of these constraints can be used when tackling the 
problem, and this will be done throughout the rest of the paper without any further mentioning. 

Throughout the paper, (E) will be referred to as the /o-constrained PCA problem, and A always denotes 
a symmetric matrix while A^ will denote a symmetric positive definite matrix. We now consider the other 
formulations that will be analyzed. 



2.2 Modified Optimization Models 

As mentioned in the introduction, most approaches to sparse PCA solve one of the following modified 
problems. The first variation is a relaxation based on the relation 



Iklli < viNhlklb VxGR" (3) 

which follows from the Cauchy-Schwarz inequality. The hard Iq constraint in problem ^ is replaced by an 
/i constraint, resulting in 

The Zi -constrained PCA Problem 

max{x'^Ax : ||x||2 < 1, ||x||i < Vk,x £ R"}, (4) 

which thanks to inequality Q is an upper bound to the original /q -constrained PCA problem. Two other 
variations are based on penalizations of the /q and /i constraints of the above formulations. Note that the 
penalized terminology is different from the usual one used in optimization, and here is used to mention that 
it rather optimizes a tradeoff between how good and how sparse the approximation is. We first penalize the 
Iq constraint in problem ^, resulting in 

The /q -penalized PCA Problem 

maxjx ^x — s||x||o : ||2;||2 < 1,2; G R"}, (5) 

where s > is a parameter that must be tuned to achieve the truly desired sparsity level at which ||a;||o = k. 
However, to avoid the trivial optimal solution x*{s) = 0, the parameter s must be restricted. First recall the 
well-known norm relations 

||2;||oo < ||3;||2 < ll^^lli) Vx s R". (6) 

Using the Holder inequalitjO, combined with inequalities ^ and ^, it follows that 



X Ax — s||x||o < ||^||oo||2;||i — s||x||o < (||^| 



F llo, 



for all X feasible for problem ^. Thus, to avoid the trivial solution, it is assumed that s G (0, ||^||oo)- Note 
that while s > ||j4||oo necessarily implies the trivial solution, taking s < ||^||oo does not guarantee we avoid 
it, but only gives a particular- bound. 

Likewise, a penalized version of the /i -constrained PCA problem dUl yields 

The /i -penalized PCA Problem 

maxjx ^x — s||x||i : ||x||2 < l,x G R""}. (7) 

Note that, as in problem dSjl, we need to restrict the value of the parameter s in order to avoid the trivial 
solution. Again, using the Holder inequality and Q, it is easy to see that 

x'^Ax - s\\x\\i < {\\A\\f - s)||x*||i, 

for all X feasible for problem (O, and hence it is assumed that s G (0, || A||i;'). 



For any u, u G R", the Holder inequality states that I (u, u) I < ||u||p||i)||q, wherep + g = pg,p > 1. 



Again, note that the formulated penahzed/relaxed problems remain hard nonconvex maximization prob- 
lems despite the convexity of their constraints. In fact, /o-penalized PCA and the /i-penalized version share 
an additional difficulty in that their objectives are neither concave nor convex. In Sections 15.31 and 15.41 we 
show how this difficulty is overcome. 

The Approximate /o-penalized PCA Problem 

The last approach for solving problem Q involves approximating the Iq norm in the objective of /o-penalized 
PCA. The idea of approximating the Iq norm by some nicer continuous functions naturally emerged from 
very well-known mathematical approximations of the step and sign functions (see, e.g., [6]). Indeed, it is 
easy to see that for any x G R", one can write 

n 

\\x\\o = ^sgn(|3;j|). 
1=1 

Thus, formally, we want to replace the problematic expression sgn(|t|) by some nicer function and consider 
an approximation of the form 

n 

Ikllo = limV<^p(|xi|), 
i=i 

where tpp : R+ — )• R+ is an appropriately chosen smooth function. Here, we consider the class of smooth 
concave functions which are monotone increasing and normalized such that (pp{0) = 0, 9^1(0) > 0. 

This suggests to approximate problem ^ by considering for p,s > the following approximate maxi- 
mization problem: 

n 

maxlx'^Ax - s^^ipp^Xil) : \\x\\2 < 1,3; G R"}. (8) 

1=1 

Approximations of the Iq norm have been considered in various applied contexts, for instance in the 
machine learning literature, see, e.g., Il24ll35l . 

There exist several possible choices for the function v?p(-) that approximate the step function, and for 
details the reader is refeiTcd to the classic book by Bracewell [61 Chapter 4]. For illusti"ation, here we give 
the following examples as well as their graphical representation in Figured] (left). 

Example 2 Concave functions (pp{-),p > 

(a) c^p(t) = (2/7r)tan-i(t/p), 

(b) ippit) = log(l + t/p)/ log{l + 1/p), 

(c) ^p{t) = {l+p/tr\ 

(d) ^pit) = 1 - e-'/P. 

The last example (d) was successfully used in the context of machine learning by Mangasarian |[24l . and 
gives 

¥?„ t =l-e"^" = S (9) 

^^' '^ 1 >0 ift/0. 



Approximations (pp{t) withp = .05 



Approximation as p ^ 




-•-(2/pi)arctan(t/p) 
"•"log(1+t/p)/log(1+1/p) 
-♦-1/(1 +p/t) 
1 -exp(-t/p) 




Figure 1 : The left plot shows four concave functions ipp (i) that can be used to approximate 1 1 a; 1 1 o ~ 
^7=1 fpi\^i\) f°i" fixed p = .05. The right plot shows how the concave approximation 1 — e^*/^ 
converges to the indicator function as p ^- 0. 



A nice feature of this example is that it also lower bounds the Iq norm, namely, we have 

n 

^iPp{\xi\)<\\x\\o, VxGR". 

See Figure [T] (right) for its behavior for various values of p. 

All problems listed in this section, as well as several other equivalent reformulations that will be de- 
rived in Section |5l will be solved by ConGradU, a conditional gradient algorithm with unit step size for 
maximizing a convex function over a compact set which is described next. 



3 The Conditional Gradient Algorithm 

3.1 Background 

The conditional gradient algorithm, is a well-known and simple gradient algorithm, see, e.g., 11211 [121 and 
the book H for a general overview of this method as well as more references. Note that this algorithm dates 
back to 1956, and is also known as the Frank- Wolfe algorithm |[T5l that was originally proposed for solving 
linearly constrained quadratic programs. 

The conditional gradient algorithm is presented here for maximization problems because of our interest 
in the sparse PCA problem. We first recall the conditional gradient algorithm for maximizing a continuously 
differentiable function F : R"^ — ;• R over a nonempty compact convex set C C R": 



max{F(x) : x G C}. 
The conditional gradient algorithm generates a sequence {x^} via the iteration: 



(10) 



e C, x^+^ = x^ + a\p> - x^), j = 0, 1, . . . 



where 

jp = argmax{(x — x^ ,'VF{x-')) : x G C}, (11) 

and where a^ G (0, 1] is a stepsize that can be determined by the Armijo or limited maximization rule 131 ■ 
It can be shown that every limit point of the sequence {x^} generated by the conditional gradient algorithm 
is a stationary point. Furthermore, under various additional assumptions on the function F and/or the set C 
(e.g., strong convexity of the function F and/or of the set C), rate of convergence results can be established, 
see in particular the work of Dunn llT3l and references therein. 

Clearly, the conditional gradient algorithm becomes attractive and simple when its main computational 
step (ITTI ) can be performed efficiently, e.g., when C is a bounded polyhedron it reduces to solving a linear 
program, or even better when it can be solved analytically. 

3.2 Maximizing a Convex Function via ConGradU 

As we shall see below, all potential reformulations of the sparse PCA problem will lead to maximizing a 
convex (possibly nonsmooth) function over a compact (possibly nonconvex) set C C R". It will be shown 
that, for such a class of problems, the conditional gradient algorithm will reduce to a very simple iterative 
scheme. First we recall some basic definitions and properties relevant to maximizing convex functions. 

Let F : R" — ;• R be convex, and C C R" be a nonempty compact set. Throughout, we assume that F is 
not constant on C. We denote by F'{x) any subgradient of the convex function Fatx which satisfies 

F{v) - F{x) >{v-x, F'{x)), Vu G R". (12) 

The set of all subgradients of F at x is the subdifferential of the function F at x, denoted by dF{x), i.e., 

dF{x) = {g : F{v) > F{x) + {v - x,g), Vw G R"}, 

which is a closed convex set. When F ^ C^, the subdifferential reduces to a singleton which is the gradient 
of F, that is dF{x) = {VF(x)}, and (IT2l ) is the usual gradient inequality for the convex function F. In 
the following, we use the notation F'(-) to refer to either a gradient or subgradient of F; the context will be 
clear in the relevant situation. 

The first result recalls two useful properties when maximizing a convex function (see |[28l Section 32]). 

Proposition 3 Let F : R" -^ R be convex, S C R" be an arbitrary set and let conv{S) denote its convex 
hull. Then, 

(a) sup{F(x) : x G conv{S)} = sup{F(x) : x G S}, where the first supremum is attained only if the 
second is attained. 

(b) If S C R" is closed with nonempty boundary bd{S), and F is bounded above on S, then sup{-F(3;) : 
X G 5} = sup{F(x) : x G bd{S)]. 

The next result gives a first-order optimality criterion for maximizing a convex function F over a com- 
pact set C G R". It uses property (a) of Proposition |3] and follows from |[28l Corollary 32.4.1]. 

Proposition 4 Let F : R" -^Kbe convex. If x is a local maximum of F over the nonempty compact set C, 
then 

{FOC) {v-x,F'{x)) <0, yveC. (13) 



10 



When F G C^, a point x G C satisfying the first-order optimahty criteria (FOC) (fT3] ) will be referred 
as a stationary point, otherwise, when F is convex nonsmooth, we will say that the point x £ C satisfies 
(FOC). 

We ai^e now ready to state the algorithm. It turns out that when the function F is convex and quadratic, 
we can also eliminate the need for finding a step size and consider the conditional gradient algorithm with a 
fixed unit step size and still preserve its convergence properties. This simplified version of the conditional 
gradient algorithm will be used throughout the paper and referred to as ConGradU. 

Algorithm 1 ConGradU - Conditional Gradient Algorithm with Unit Step Size 
Require: x° G C 



while stopping criteria do 

x^^^ G argmax{(x — x^ ,F'{x^)) : x G C} 
end while 
return x^ 



To analyze ConGradU, in what follows, it will be convenient to introduce the quantity 

j{x) := max{(u - x, F'{x)) : v £ C} (14) 

for any x G R". Since C is compact, this quantity is well-defined and admits a global maximizer 

u{x) G argmax{(t; — x,F'{x)) : v G C}, 

and thus, we have 7(x) = {u{x) — x, F'{x)). 

In terms of the above defined quantities, we thus have that x* satisfies (FOC) is equivalent to saying that 
X* is a global maximizer of 'ip{v) = {v — x*, F'{x*)), i.e., 

X* G argmax{(z; — x* , F'{x*)) : v G C} = u{x*). 

Thus the ConGradU algorithm is nothing else but a fixed point scheme for the map u{-), and simply reads 
as 

x° G C, x^+^ =u{x^), j = 0,l,.... 

Lemma 5 Let F : R" — )• R Z?e convex, C C R" be nonempty and compact, and let 7(x) be given by f l7?l ). 
Then, 

(i) ■y{x) > for all x £ C. 

(ii) For any v £ C and any x £ R", 

F{u{x)) - F{x) > 7(x) >{v- x,F'{x)). 

Proof. The proof of (i) and the right inequality of (ii) follows immediately from the definition of 7(x), while 
the left inequality in (ii) foUows from the subgradient inequality for the convex function F: 



F{u{x)) - F{x) > {u{x) - x,F'{x)) = 7(x). 

We are ready to state the convergence properties of ConGradU. 
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Theorem 6 Let F : R" -^ Vibe a convex function and let C C R"^ be nonempty compact. Let {x^} be the 
sequence generated by the algorithm ConGradU. Then the following statements hold: 

(a) the sequence of function values F{x^) is monotonically increasing and 

lim 7(x-') = 0. 

j->-oo 

(b) If for some j the iterate x^ satisfies 'y{x^) = 0, then the algorithm stops with x^ satisfying (FOC). Oth- 
erwise the algorithm generates an infinite sequence {x-^} with strictly increasing function values {F{x^)}. 

(c) Moreover, if F is continuously differentiable, then every limit point of the sequence {x-^} converges to a 
stationary point. 

Proof. By definition of the iteration of ConGradU, using our notations, the sequence {x^} is well defined 
via x^~^^ G u{x^),\/j = 0, 1 . . . . Invoking Lemma|5l we obtain 

< 7(xJ) < F{x^+^) - F(x^), Vi = 0, 1, . . . , 

showing that the sequence {F{x^)} is monotone increasing. Summing the inequality above for j = 

0, . . . , A^ — 1, we have 

^7(xJ')<F(x^)-F(xO). 
j=o 

Since C is compact and F(-) is continuous, we also have F{x^) < max{F(x) : x G C} := F*, and 
thus it follows from the above inequality that ^j=q jix^) < F^, — F{x^), and hence the nonnegative series 
Z^^o 7(^''^^) i^ convergent so that 7(x-') converges to 0. Now, since 7(x-') > for all j = 0, . . ., then if for 
some j the iterate x^ is such that 7(x'') = then the procedure stops at iteration j with x^ satisfying (FOC). 
Otherwise, 7(x-') > and the iteration generates an infinite sequence with F(x-^+^) > F{x^). In the latter 
case, assuming now that F G C^, we will now prove the last statement of the theorem. Since C is compact, 
the sequence {x-^} C C is bounded. Passing to subsequences if necessary, for any limit point x°° of {x-'} 
we thus have x^ — )• x°°. Without loss of generality we let u{x^) — )■ u. Using the facts 

7(xJ') = {u{x^) - x\F'{x^)) and {v - x\F'{x^)) < (n(x^') - x^ F'(x-'')), Vv G C, 

and since ^{x^) — )■ and F G C^, passing to the limit over appropriate subsequences in the above relations, 
it foUows that {u - x~, F'(x°°)) = and {v - x°°, F'{x°°)) < {u - x°°, F'(x~)), \/v G C, and hence 
{v — x°°, F'{x°°)) < 0, V-y G C, showing that x°° is stationary. ■ 

Remark 7 (a) For the case of convex F and bounded polyhedron C, as noted in the introduction, Man- 
gasarian l{24\l seems to have been the first work suggesting the possibility of using a unitary step size in 
the conditional gradient scheme and proved that the algorithm generates a finite sequence (thanks to the 
polyhedrality of C) that terminates at a stationary point. 

(b) Very recently, the use of a unitary stepsize in the conditional gradient scheme was rediscovered in 
M9^ with C being an arbitrary compact set, which is identical to Algorithm [7] 

(c) The proof of Theorem |6| is patterned after the one given in H24\l . Note that part (a) also follows 
from fl9\l as well. Furthermore, under stronger assumptions on F and C, / [79l Theorem 4] also established 
a stepsize convergence rate giving an upper estimate on the number of iterations the algorithm takes to 
produce a step of small size. 

(d) Finally, as kindly pointed out by a referee, the proof of convergence could also probably be derived 
by using the general approach developed in the classical monograph ofZangwill / I57I/ . 
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We end this section with a particular realization of ConGradU for an interesting class of problems given 
as 

(G) max{f{x)+g{\x\):xeC} 

X 

where 

/ : R" — )• R is convex, 

g : R" — )■ R is convex differentiable and monotone decreasing^, 

C C R" is a compact set. 

Our interest for this form is mainly driven by and is particularly useful for handling the case of the 
approximate /o-penalized problem (cf. Section 1531 ) . which precisely has the form of this optimization model 
with adequate choice of the kernel tpp that can be used to approximate the /q norm (cf. Section |2l). It will 
also allow for developing a novel simple scheme for the /i-penalized problem (cf. Section [5^ . 

Note that under the stated assumptions for g, the composition g{\x\) is not necessarily convex and thus 
ConGradU cannot be applied to problem (G). However, thanks to the componentwise monotonicity of g{-), 
it is easy to see that problem (G) can be recast as an equivalent problem with additional constraints and 
valuables as follows: 

{GG) max{/(x) + g{y) :\x\<y,x€ C}. 

Note that, without loss of generality, an additional upper bound can be imposed on y in order to enforce com- 
pactness of the feasible region of problem (GG) (e.g., by setting the upper bound on y as y^ := argmax{|j;| : 
X G C}), but this need not be computed in order to establish our following result. Clearly, the new objective 
of (GG) is now convex in (x, y) and thus we can apply ConGradU. We show below that the main iteration in 
that case leads to an attractive weighted /i-norm maximization problem, which, in turn, as shown in Section 
|4]can be solved in closed form for the cases of interest, namely when C is the compact set described by the 
unit sphere or unit ball. 

Proposition 8 The algorithm ConGradU applied to problem (GG) generates a sequence {x^} by solving 
the weighted li-norm maximization problem 

x^ e C, x^^^ = argmaxKa-', x) — 2_]'^l\^i\ '■ ^ ^ C},j = 0, . . . , (15) 

i 

where w^ = -^'(Ix^l) > and a^ = f'{x^) G R^ 

Proof. Applying ConGradU to the convex function H{x,y) := f{x) + g{y), with y^ = \x^\^x'^ G C we 
obtain, 

(x-'+Sy^+^) = aiguiaK{{x-x\f'{x^)) + {y -y^,g'{yJ)) -.x eG,\x\<y} 
x,y 



argmax < {x — x^ , f'{x^)) + max{(y — y-' ,g'{y-^)) : \x\ < y}? 
x&c [ y J 

argmax {(x-x^/'(j;^)) + (|x| -y^,g'{y^))} , 

x&C 



'By monotone decreasing, we mean componentwise, i.e., eachi component of g'{-) the gradient of g is such that gi{-)i < 
Vi G H. 



13 



where the last max computation with respect to y uses the fact that g' is monotone decreasing. It is also clear 
that, given the initialization y" = \x^\ and y^'^^ = argmax{(y, (/'(y-')) : \x^~^^\ < y}, we have y^ = \x^\ 
for all j = 0, 1, . . . Omitting constant terms, the last iteration can be simply rewritten as 

x^~^^ = argmax{(x, /'(x-^)) + (|x|,y'(|x-'|))}, 
which with w^ := —g'{\x^) > proves the desired result stated in (fTSl ). ■ 



4 A Simple Toolbox for Building Simple Algorithms 

The algorithms discussed in this paper are based on the fact that the /q -constrained PCA problem as well 
as many of the penalized and constrained Iq and /i optimization problems that are solved by the proposed 
conditional gradient algorithm have iterations that either have closed form solutions or are easily solvable. 
Specifically, the main step of ConGradU maximizes a linear function over a compact set, and the following 
lemmas and propositions show that for certain compact sets (e.g {x € R" : ||x||2 = 1, ||2;||o < k} and 
{x G R" : ||x||2 = 1, ||x||i < A;}), these subproblems are easy to solve. 

In addition, propositions are given that will be used to reformulate (in Section |5]l problems such as Iq 
and /i-penalized PCA, which have neither convex nor concave objectives, to problems that maximize a 
convex objective function. The propositions for maximizing linear functions over compact sets can then be 
used in ConGradU as applied to the reformulated Iq and /i-penalized PCA problems. Combined with the 
conditional gradient algorithm discussed above, these propositions are the only tools required throughout 
the remainder of the paper. 

We start with an obvious but very useful lemma that is used in all of the following propositions. 

Lemma 9 Given / a G R", 

max{(a,x) : ||3;||2 = l,x £ R"} = max{(a, x) : ||x||2 < l,x G R"} = ||a||2, 
with maximizer X* = a/||a||2. 
Proof. Immediate from Cauchy-Schwarz inequality. ■ 

The following propositions make use of the following operator: 
Definition 10 Given any a G R", define the operator Tfc : R" — )• R" by 

Tk{a) := argmin{||x — a\\\ : ||x||o < fc,x G R"}. 

X 

This operator is thus the best A;-sparse approximation of a given vector a. Despite the nonconvexity of 
the constraint, it is easy to see that {Tf^{a))i = ai for the k largest entries (in absolute value) of a and 
{Tk{x))i = otherwise. In case the k largest entries are not uniquely defined, we select the smallest 
possible indices. 

In other words, without loss of generality, with a G R" such |ai| > ... > |a„|, we have 



{Tk{a))i 



0, otherwise. 
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Computing Tfc(-) only requires determining the k^'^ largest number of a vector of n numbers which can be 
done in 0{n) time m and zeroing out the proper components in one more pass of the n numbers. 

The next proposition is an extension of Lemma |9] to /q -constrained problems. This is the simple result 
that maximizing a linear function over the nonconvex set {x G R" : ||x||2 = 1, ||x||o < k} is equally simple 
and can be solved in 0{n) time. 

Proposition 11 Given 7^ a G R", 

max{(a,x) : ||x||2 = 1, ||x||o < k,x £ R"} = ||Tfc(a)||2 (16) 



with solution obtained at 



Tk{a) 
\Tk{a)\\ 



Proof. By Lemma |9l the optimal value of problem (fT6l ) is \/Yliex ^f ^'^^ some subset of indices I C [n] 
with \I\ < k. The set I that maximizes this value clearly contains the indices of the k largest elements of 
the vector \a\. Thus, by definition of Tk{a), solving problem (IT6l) is equivalent to solving 

max{(x,Tfc(a)) : ||2;||2 = l,x £ R"} 

X 

from which the result follows by Lemma|9l ■ 

Another version of this result gives the solution with a squared objective. 
Proposition 12 Given a G R", 

max{(a, x) :||x||2 = l,||2;||o<A;,xG R'"} = ||7fc(a)||2 

X 

with solution obtained at 

Tk{a) 



\\Tk{a)\W 

Proof. Notice that the optimal objective value in Proposition [TT] is nonnegative, and hence the squared 
objective value here does not change the optimal solution to the problem with linear objective. The result 
follows. ■ 

The above propositions will serve to derive the novel schemes given in Section |5] The next result is 
a modification of Proposition [121 It shows that the /o-penalized version of maximizing a squared linear 
function yields a closed-form solution. 

Proposition 13 Given a G R", s > 0, 

n 

max{(a,x) — sllxllo : llxlb = l,x G R"} = > (■ 

X ^ ' 

i=l 

is solved by 

ai[sgn{a} - s)]+ 



a? 



T.Uci][sgn{a]-s)]. 
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Proof. Assume without loss of generality that |ai| > ... > |a„|. The problem can be rewritten as 

max{— sp + max{(a, x)^ : ||x||2 = 1, ||x||o < p, x G R"}}- 

pG[n] X 



Using Proposition [T2j the inner maximization in x is solved at 



[ 0, otherwise, 

and the problem is equal to 

p p n 

max{-sp+ \\Tp{a)\\l} = max{ Va^ - sp} = max{ V (af - s)} = V (a^ _ s)^. 
pe[n] peln] ■^ pe[n] ■^ ^ 

Notice that the optimal p is the largest index i such that a^ > s which makes the above expression for x* 
equivalent to the expression in the proposition. ■ 

Our next two results are concerned with /i -penalized/constrained optimization problems. First, it is 
useful to recall the following well-known operators which are particular instances of the so-called Moreau's 
proximal map [,26 ] ; see, for instance. 111 for these results and many more. Given a G R" and W ann x n 
diagonal matrix W = diag(tt;), u; G R"^ with positive entries Wi, let 

n 

\\Wx\\i:=Y,Wi\xi\; B^ := {x G R" : ||Ty"^x||oo < !}• 

i=l 

Then, 

Sw{a) '■= argmin{-||x — a||2 + ||VFx||i} = (|a| — w)+sgn(a), (17) 

nB™(a) := argminjllx — a||2 : X G B'J^} = sgn(a)min{i(;, |a|} = a — S'^(a), (18) 

X 

where 5^ (a) and IIb™ (a) are respectively known as the soft-thresholding operator and the projection oper- 
ator. 

Proposition 14 For a G R*^, w G R++, and W = diag{w) 



max{(a,x) — ||M^x||i : ||x||2 < l,x G R"} 

which is solved by 

x* = 5^(a)/||5^(a) 



\ 



^(|ai| - Wi)+ = \\Swic 



i=l 



Proof. By the Holder inequality, we have || VFx||i = max||„||^<i(u, Wx) = max{(z, x) : z e B^}. Using 
the latter, we obtain 

max|(a,x) — ||VFx||i : llxlU < 1| = max min(a — 2;,x) 

||x||2<i^eB- ' 

= min max (a — z, x) 

2GB» ||a;||2<l 

= min{||a-^||2:zGB^} = ||5^(a)||2, 
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where the second equaUty follows from standai^d min-max duality ll28l . the third from Lemma |9j and the 
last one from using the relations (fT7])-(fT8]). where the optimal z* = a — Sw{a) and x* follows from Lemma 
i ■ 

We next turn to the Z2/^i-constrained problem, 

max{(a, x) : ||x||2 < 1, ||2;||i < k,x e R"}, (19) 

and state a result similar to Proposition [TT] for maximizing a linear function over the intersection of the I2 
unit ball with an li constraint. While maximizing over the intersection of the I2 unit ball with an Iq ball has 
an analytic solution, here we need an additional simple one dimensional search to express the solution of 
(IT9I) via its dual. 



Proposition 15 Given a G R", we have 

max{(a, x) : \\x\\2 < 1, ||x||i < k,x £ R"} = minjAA; + IISAela)!^}) (20) 

the right hand side being a dual ofl\19]). Moreover, if X* solves the one-dimensional dual, then an optimal 
solution of(il9i is given by x*(A*) where: 

x*(A) = Sxeia)/\\Sxeia)h, (e = (1, . . . , 1) G R"). (21) 

Proof. Dualizing only the ^i constraint, standard Lagrangian duahty ll28l implies: 

max{(a,x) : ||x||2 < 1, ||x||i < k,x G R""} = min{A/c + ipiX) : A > 0}, 

with 

V'(A) := max {{a,x) - \\\x\\i} = ||5'Ae(a)||2 

|x||2<l 

where the last equality follows from Proposition [14] with x*(A) as given in (|2TI ). ■ 

The above propositions will be used to create simple algorithms for Iq and Zi -constrained and penalized 
PCA. 

5 Sparse PCA via Conditional Gradient Algorithms 

This section details algorithms for solving the original /q -constrained PCA problem Q and its three modifi- 
cations (IDl, (111), and ^. Everything is developed using the simple tool box from SectionlH We derive novel 
algorithms as well as other known schemes that are shown to be particular realizations of the conditional 
gradient algorithm. A common and interesting appeal of all these algorithms is that they take the shape of a 
generalized power method, i.e., they can be written as 



x^+i 



S{Axi 



\\S{AxJ)\\, 



where S : R" — )• R" is a simple operator given in closed form or that can be computed very efficiently. 

Table [T] summarizes the different algorithms that are discussed throughout this section. Except for the 
alternating minimization algorithm for Zi -penalized PCA of BTI . they are all particular realizations of the 
ConGradU algorithm with an appropriate choice of the objective/constraints {F, C). 
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Type 


Iteration 


Per-Iteration 
Complexity 


References 


^o-constrained 


, Tfc((A+f/„)x^) 
•^ ||T&((A+f/„)2;J)||2 


0{kn),0{mn) 


novel 


^1 -constrained 


^j + 1 Sgn{{{A+§)x^)i)i\{{A+§}x'),\-X^) + 
VEh(l(('4+f):^^)h|-A^)^ 


Oin^),0{mn) 


136| 


^1 -constrained 


s^ is (fc + l)-largest entry of vector \Ax^ 


0{n^),0{mn) 


130) 


^o-penalized 


.7+1 _ Ejsgn((br^^y-s)]+(fcr^')''. 


0{mn) 


|29||19| 


^0 -penalized 


^j + l Sgn{2{Ax'),){\2{Ax^)i\~Sip'^{\xi\)) + 

^j:f,mAx^)H\-sv'p{\xi\))i 


0{n^) 


131| 


^1 -penalized 


yJ+1 = argmin^ {^, ||fe, - x^y^6,||^ + A||y||2 + s\\y\\i} 

^,+i_ (E.&.brV+^ 
lt(E,fc.fcr)y^+i||2 


See Section|i.4| 


ED 


li -penalized 


^j + l Sgn(((A+f):r^)0(|((A+f)x^).|-s) + 


0(n2),0(mn) 


novel 


^1 -penalized 


.7+1 Ei(|br-^'l-s)+sgn(b^z')6i 
IIE.(|fc.''^^|-'<)+sgn(6,^'z:')b,||2 


0[mn) 


12911191 



Table 1: Sparse PCA Algorithms. For each iteration, B e R™^" is a data matrix with A = 
B^B. hi is the z"* column of B (except for the ^i-penalized PCA of ^A\\ where it is the i*'' row 
of B). For iterations with two complexities, the first uses the covariance matrix A and the second 
uses the decomposition A ~ B^B to compute matrix-vector products as Ax or B'^{Bx). Several 
iterations have two complexities, depending on whether data matrix B is available. The regularized 
/i -constrained version of | l36| is also novel. The Iq and ^i-penalized iterations of 1291 require an 
0{mn) transformation to recover a sparse x* from z* . 



We stress that the first algorithm for /q -constrained PCA is the only algorithm that applies directly to 
the original and unmodified Zq -constrained PCA problem. The other algorithms are applied to modified 
problems where tuning a parameter is needed to get an approximation to the desired fc-sparse problem. 
Unless otherwise specified, A is only assumed to be symmetric and Afj is used to denote the regularized 
positive definite matrix. The exceptions are only when we assume we aie given a data matrix B so that 



18 



A = B^B. 

5.1 /o- Constrained PCA 

In this section, we focus on the original Zq -constrained PCA problem 

niax{x Ax : \\x\\2 = 1, \\x\\o < k,x £ R"}. (22) 

We apply the ConGradU algorithm with the constraint set C = {x G R*^ : ||x||2 = 1, ||a;||o < ^} and the 
convex objective (cf. Section |2]): 

qa{x) = X {A + al)x = x A^^x, a > 0. 

When A G 5" is already given positive semidefinite, the objective is already convex and there is no need 
for regularization and thus we simply set a = 0. The resulting main iteration of ConGradU reduces to, 

x^'+i = argmax {(x, A^x^) : ||x||2 = 1, ||x||o <k,xeR^}= . Jf;^"'^,^.. , J = 0, 1, . . . (23) 

with x^ £ C and where the second equality is due to Proposition [TT] 

This novel iteration is obtained by maximizing a continuously differentiable convex function over a 
compact nonconvex set, and by Theorem [6l every one of its limit points converges to a stationary point. 
The complexity of each iteration requires computing Aa-x^ where A^^ G s"X" and x^ is A;-sparse so the 
matrix-vector product requires 0{nk) complexity. Computing the T'fc(-) operator is 0{n) so each iteration 
is 0{nk). For very large problems where only a data matrix B G R"^^" can be stored, the matrix- vector 
product AfjX^ is computed as B'^{Bx^), which requires complexity 0{mn), so each iteration is actually 
0{'mn). The new scheme based on iteration (1231 ) is an extremely simple way to approach the original 
^o-constrained PCA problem (l22l ) for any given matrix A^ G 5" and is the cheapest known non-greedy 
approach to /q -constrained PCA. 

Thus, a very simple gradient algorithm (and not greedy heuristics) can be directly applied to the desired 
problem. To put our novel scheme in perspective, let us recall the work ll25l which offers the following 
greedy scheme. Given a set of k indices for variables with nonzero values, n — k subsets of indices are 
created by appending the k indices with one from the n — k remaining indices. Then n — k possible matrices 
are computed from the n — k groups of indices and the matrix with the maximum eigenvalue gives the index 
that provides the k + 1-sparse PCA solution. A full path of solutions can be computed for all values of 
k, but at a costly expense 0{n^) (or up to A:-sparsity in 0{kn^)), and with no statements on stationaiity 
that we have. The work |[25l also produces a branch-and-bound method for computing the exact solution, 
however, this method is amenable to only very small problems. The authors of ifTOl considered what they 
call an approximate greedy algorithm that generates an entire path of solutions up to /c-sparsity in 0{kv?) 
(k G {1, • • • ,n}). Rather than computing the n — k maximum eigenvalues each iteration, their scheme 
computes n — k dot products each iteration. While the path is cheap for all solutions, it is expensive when 
only the /c-sparse solution is desired for a single value k. The total path (up to A;-sparse solutions) using our 
computations can be computed in 0{kmn) (assuming finite convergence). While the (approximate) greedy 
algorithms are more computationally expensive, as will be shown in Section [6l they do offer good practical 
performance (under measures to be discussed later). 

Finally, we again stress the importance of considering such simple approaches where we can only pro- 
vide convergence to stationary points. Recent convex relaxations for this problem |[mi22l . while offering 
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new insights to the problem, have the same disadvantage that the gap to the optimal solution of problem (I22b 
cannot be computed (together, these methods can give primal-dual gaps). Only a gap to the optimal solution 
of a relaxation (convex upper bound) is computed. Another major disadvantage is that convex relaxations 
are not amenable to very large data sets as the per-iteration complexity is 0{n'^) and they require far more 
iterations in practice. Application of our scheme is limited only by storage of the data; only nk entries of 
the covariance matrix are needed at each iteration. 

5.2 /i -Constrained PCA 

In this section, we focus on the Zi -constrained PCA problem 

max{x'^Ax : ||x||2 < 1, ||x||i < \^,x G R"}. (24) 



In 112211 . we provide a convex relaxation and corresponding algorithm with a per-iteration complexity of 
O(n^), but here, we are again only interested in much less computationally expensive methods. We present 
two recent algorithms that have been proposed through different motivations and show that they can be re- 
covered through our framework. 

An Alternating Maximization Sclieme 

Recently, Witten et al. |[36l have considered the sparse Singular Value Decomposition (SVD) problem 



max{x'^By : \\x\\2 = 1, II2/II2 = 1, lkl|i < h, \\y\\i < k2,x,y G R"}, 
x,y 

where B G r^x" j^ t^g jata matrix and ki, k2 are positive integers. They recognized that the objective 
is linear in x with fixed y (and vice-versa) and that the problems with x or y fixed can be easily solved 
(see Proposition [15]). This fact motivates them to propose a simple cheap alternating maximization scheme 
which in turn can be used to solve /i -constrained PCA. However, the work |[36l does not recognize this as 
yet another instance of the conditional gradient algorithm nor does it provide any convergence results. We 
show below how to simply recover this algorithm under our framework by applying ConGradU directly to 
problem ((24l) for which the convergence claims of Theorem [6]hold. 

Thanks to the results established in Section |2l we derive it here with the additional regularization term, 
i.e., with qa{x) = x'^A^^x, cr > for an arbitrary matrix A £ S"^ (with a = when A is already given 
positive semidefinite). Applying ConGradU, the scheme reads: 



x-^ ' " = argniax{(y4o-a;,a;-') : ||a;||2 < 1, ||x||i < yk^x G R"}. (25) 



.i+i 



The above iteration maximizes a continuously differentiable convex function over a compact set so Theorem 
|6]can be applied to show that every limit point of the sequence converges to a stationary point. Furthermore, 
thanks to Proposition [151 this scheme reduces to the iteration 

i+i _ S^,,{A„x^) 

where A-' is determined by solving (l20l ) with a := A^jX^ (cf. Proposition [TS]). The most expensive operation 
at each iteration is computing A^jX^ where Afj G S" and x^ G R" so the per-iteration complexity is 0{r?^. 
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The Expectation-Maximization Algorithm 

Another recent approach to problem ((24l) is developed in ll30l which they motivate as an Expectation- 
Maximization (EM) algorithm for sparse PCA. Motivation comes from their derivation of computing prin- 
cipal components using EM for a probabilistic version of PCA, which is actually equivalent to the power 
method. The authors solve Zi -constrained PCA, however also want to enforce that ||x||o = A; at each iteration 
as well. Their algorithm can be written as 

\\S^,,{A„xi)\\2 

where A-' is the {k + l)-largest entry of vector | ActX-'|. Note that the iteration form is identical to that above 
for the alternating maximization scheme, except for the computation of A^. Thus, each iteration can be 
interpreted as solving 



X- 



j+i 



argmax{(^o-x,x-') : ||x||2 = 1, ||a;||i < vkJ ,x G R""}, (27) 



where y is chosen at each iteration specifically so that x^~^^ is A;-sparse, and can easily be seen to be a 
variant of ConGradU. Enforcing A-' to be the {k + 1) -largest entry of vector l^tjX-'l implicitly sets k^ to a 
value that achieves fc-sparsity in x^^^. While this choice of thresholding enforces exactly k nonzero entries, 
the iteration becomes heuristic and neither applies to the true Iq or Zi -constrained problem. It is cheap, 
with the major computation being to compute A^jX^ , and performs well in practice as shown in Section |6] 
However, unlike our other iterations, there are no convergence results for this heuristic. 

5.3 /o-Penalized PCA 

We next consider the /o-penahzed PCA problem 

maxjx^Ax - s||x||o : ||x||2 < 1,3; G R"} (28) 

which has received most of the recent attention in the literature. We describe two recent algorithms to this 
problem and show again that they are direct applications of ConGradU. 

Exploiting Positive Semidefiniteness of A 

The first approach due to |[T9l assumes that A is positive semidefinite (i.e., A = B^B with B G r"^x") and 
writes problem ( |28] | as 

max{||53;||^-s||x||o : ||x||2 < l,x G R*"}. (29) 

The objective is neither concave nor convex. First, using the simple fact (consequence of Lemma |9ll 



|5x||2 = max{(z,i?x) }, 

lkl|2<l 



the problem is equivalent to 



max max{(2;,i?x) — s||x||o} = max max {(i? z,x) — s||x||o}. 

I|a;||2<l lkl|2<l Il-2||2<1 ||3::||2<1 

Thus, we can now apply Proposition [T3]to the inner minimization in x and then get 

n 

max {||i?x||2 — s||x||o : ||x||2 < 1} = max {> [{bi,z)'^ — s]+ : ||z||2 < 1} (30) 
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where bi G R™ is the i^^ column of B. This reformulation was previously derived in lITOl . where the authors 
also provided a convex relaxation. Note that the reformulation operates in the space R"* rather than R". 
Since the objective function f{z) := Yliii^i^^)^ ~ ^]+ i^ ^^^ clearly convex, we can apply ConGradU. 
Noting that a subgradient of f{z) is given by 

n 

2j^[sgn((6i,z)2-s)]+((6„z))6i, 
i=l 

the resulting iteration (using Lemma |9l) yields: 



.i+i 



Y:d^gn{{{b,,zi)r-s)U{{h,z^))b 



(31) 



•i\\2 



and the convergence results for the nonsmooth case of Theorem [6] apply. 

This is exactly the algorithm recently derived in llT9l . Note that an 0{mn) transformation is then needed 
via Proposition [T3] to recover the solution x of the original problem ( |29l ). This is the first cheap (0{mn) 
per-iteration complexity) and nongreedy approach for directly solving the /q -penalized problem. As with 
ll36l for /i -constrained PCA, ||29l approach /o-penalized PCA via /o-penalized SVD. After modifications to 
write it out for /o-penalized PCA, the resulting iteration of their paper is equivalent to iteration (|3TI ). How- 
ever, they did not offer a derivation or state the convergence properties given in |[T9l . 



Approximating the /q -penalized Problem 

As explained in Section |2l we consider the ^o-penalized problem whereby we use an approximation to the 
Iq norm, that is, we consider the problem of maximizing a convex function: 

ma.x{x^A„x + g{\x\) : \\x\\2 < l,x G R"} (32) 

X 

where the convex function g is defined by 

n 
g{z) := -S^V9p(Zi), s > 0, 
i=l 

with ipp concave satisfying the premises given in Section|2]and A^^ = A + al, a > 0. Applying ConGradU 
to this problem, as shown in Section [3l using Proposition [8] reduces the iteration to the following weighted 
Zi-norm maximization: 



x^+^ 



argmaxK^CT^;,^;"') — /,'^i\^i\ ■ ll^lb = 1} (33) 



where wl = sipp{\xl\). 

Proposition [14] shows that this problem can be solved in closed form so that the conditional gradient 
algorithm becomes 

j + l _ ^wj[^a-^ ) 

~ \\S^j{A^xi)\\2 

where again wj = s(pp{\xl\). Theorem |6] regarding convergence of every limit point of the resulting se- 
quence to a stationary point again applies. The per-iteration complexity of this iteration is 0{v?), which is 
reduced to 0{mn) if we have the factorization Afj = B^B. 
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Depending on the choice of ip (cf. Section |2l), we thus have a family of algorithms for solving problem 
\. For example, with the Example [2] (b) given in Section |2l we obtain the recent algorithm of OTl which 
was derived there by applying what is called the minorization-maximization method, a seemingly different 
approach; they consider ^ G S" and represent the objective as a difference of convex functions plus the 
penalization: x^ Ax — s X]j V'pd^il) = x'^ A^x — ax'^x — sYlii^p{\xi\)- The objective is minorized by 
linearizing the convex term x'^ A^x — s ^^ (/9p(|xj|) resulting in a concave lower bound that is maximized. 
When cr = 0, this is identical to using the conditional gradient algorithm ConGradU, which is only one 
example of a minorization-maximization method. For more on the minorization-maximization technique 
and its connection to gradient methods see the recent work ||3l and references therein. We also note that OTl 
derived their algorithm for the sparse generaUzed eigenvalue (GEV) problem 

mayi{x^ Ax : x^ Bx < 1, ||x||o <k,x£ R"} (34) 

where ^4 G S"^ and B G S"^, which includes as a special case the sparse PC A problem when B is the 
identity matrix. The resulting algorithm of |[3TI for this problem requires computing a matrix pseudoinverse, 
and is much more computationally expensive (and not amenable to extremely large data sets) than the same 
algorithm for sparse PCA. Moreover, using the results of Section|2j clearly the general iteration for indefinite 
A need not be considered and sparse GEV can always be approached with a closed-form conditional gradient 
algorithm which still requires computing a matrix pseudoinverse (the closed-form iteration is derived in 

lEl). 



5.4 /i -Penalized PCA 

Consider the li-penaUzed PCA problem 

maxjx ^x — s||x||i : ||3;||2 = 1,3; G R"}. (35) 



This problem has a nonconvex objective. The work jlll provides a convex relaxation that is solved via 
semidefinite programming with a per-iteration complexity of 0{n^), but here, we are again only interested 
in much less computationally expensive methods. We describe two methods that exploit positive semidefi- 
niteness of A, along with a novel scheme that does not require ^ G S^. 

Reformulation with a Convex Objective 

To apply ConGradU, we need either a convex objective or, as shown, an objective of the form /(x) + 5f(|x| ) 
with /, g satisfying certain properties (cf. Section|2l). Exploiting the fact that A G 5", with A = B^B, B G 
jjmxn^ an equivalent reformulation of the original ^o-constrained PCA problem can use the square root 
objective ||i?x||2, and hence the corresponding /i-penalized PCA problem reads, instead of (1351 ). as 

max{||i?x||2 — •s||x||i : ||x||2 < l,x G R"}. 

One can think of this as replacing the objective in our original /o-constrained PCA problem ^ with 
Vx'^Ax = \\Bx\\2 (which is an equivalent problem) and then making the modifications for /i-penalized 
PCA. The objective remains problematic and is neither convex nor concave. However, using again the fact 
that ||-Bx||2 = max{(2;, Bx) : \\z\\2 < 1, ^; G R™}, the problem reads 

max{||i?x||2 — s||x||i : ||x||2 < l,x G R""} = max max {(z, Sx) — s||x||i}. 

|z||2<l ||^i|2<l 
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Thus, applying Proposition [T4l the inner maximization with respect to x can be solved explicitly and the 
problem can be reformulated as maximizing a convex objective, and we obtain: 

n 

max {||i?x||2 — slklli : lla^lb < 1| = niax | > (\b, z\ — s), : \\z\\2 < 1}, 

Rn ^'' " " " " " -" 1>"^ Z ^ ~^ 

zgK. , 

2 = 1 

where b-i is the i^^ column of B. 

We can now apply ConGradU to the convex (for similar reasons as for the /g-penahzed case in Section 
15.31 ) objective f{z) = Yliilbf z\ — s]\, and for which our convergence results for the nonsmooth case hold 
true. A subgradient of / is given by 



2^i\bjz\-s)+sgn{bfz)h, 



i=l 



and, using Lemma |9l the resulting iteration is 



.+!_ E^i\bIz^-sUsgnibJz^)h 



\\E^{\bIz^-shsgn{bfz^)hh■ ^ ^ 



This is exactly the other algorithm recently derived in 11191 . Note that to recover the solution to problem 
(IT4l ) needs an 0{mn) transformation via Proposition [141 This algorithm has an 0{mn) per-iteration com- 
plexity. Note also that this algorithm was stated earlier in ll29l but no such derivation or convergence results 
were given. 

A Novel Direct Approach 

We next derive a novel algorithm for problem (1351 ) by directly applying the conditional gradient algorithm. 
Indeed, problem (1351 ) reads as maximizing f{x) + g{\x\) with f{x) convex, g{x) convex, differentiable, 
and monotone decreasing, with f{x) = x^ A^x and g{u) = — ^^ Uj where A„ is as previously defined. 
Applying the ConGradU algorithm and Proposition [8] leads to the iteration 

x^^ = argmaxK^o-^;-',^;) — sHxHi : ||x||2 = 1}, (37) 

which by Proposition [l4]reduces to 

where e is a vector of ones. Theorem [6] applies, showing that any limit point of this iteration is a stationary 
point of the Zi -penalized PC A problem. 

The matrix- vector product Ax^ is the main computational cost so the per-iteration complexity is 0{n?') 
(or 0{mn) if computing B'^{Bx^)). This approach can handle matrices A that are not positive semidefinite 
(by taking o" > 0) and has stronger convergence results than the conditional gradient method applied to the 
reformulation of |[T9l . i.e., this approach is equivalent to applying ConGradU with a differentiable objective 
function (by Proposition [T4l) and thus satisfies part (c) of Theorem |6] [il9il apply ConGradU to a different 
nondifferentiable formulation for which our theory does not apply. 

For the sake of completeness, we end this section by mentioning one of the earher cheap schemes for 
sparse PCA, even though it does not fall into the category of directly applying ConGradU. 
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An Alternating Minimization Scheme 

One of the earlier cheap approaches to sparse PCA, specifically for /i-penalized PCA, is proposed in pT] 
(SPCA). While they generalize all results to multiple factors, we only discuss the one factor case. They pose 
spai^se PCA as an /i//2-regularized regression problem, specifically 

m 
{x*,y*) =argmin{V||5i - xy^bi\\l + X\\y\\l + s\\y\\i : \\x\\l = l,x,y G R"} (39) 

x,y _, 

where A and s are the I2 and /i regularization parameters, respectively, and B G r™x" js a data matrix with 
rows bi G R". When s = 0, they show that y* is proportional to the leading eigenvector of B^B. Indeed, 
when s = 0, problem (l39l ) can be recast as a classical maximum eigenvalue problem in x: 



X 



aigmax {x'^B'^B{B'^B + \In)~^B'^Bx : \\x\\l = l,xe R"} (40) 



by first solving for y (simple algebra shows y* = {B^B + XIn)^^B^Bx) and plugging y* into ( [39l ). It is 
easy to show that the x* that solves problem (l40l ) is equal to the leading eigenvector of B^B for all A > 0, 
and thus, for the purposes of finding the leading eigenvector, we do not need to regularize the matrix (i.e., 
set A = 0). 

Problem ( [39l ) uses an li penalty, known as a LASSO penalty, in order to induce sparsity on y resulting 
in an approximate sparse leading eigenvector y*/||y*||2- An alternating minimization scheme in x and y is 
proposed to solve problem ( [39l ). For fixed y, we have 



E 



/kh = Y, '■''''" - ^iy^l")''!'' + (n'^'iif ':'''■-) = -2/(E '"''?') 



where C is a constant (using the constraint ||a;||2 = 1), so that the minimizer x* is solved by maximizing 
a linear function over the unit sphere which, by Lemma|9l is easily solved in closed-form. For fixed x, the 
minimizer y* is found by solving an unconstrained minimization problem of the form || • l^ + s|| • ||i (also 
known as the elastic net problem). This problem can be solved efficiently for fixed s using fast first-order 
methods such as FISTA |2i| or for a full path of values for s using LARS f\M . Thus, PTl solve a nonconvex 
problem in two variables using alternating minimization. While this scheme is computationally inexpensive 
compared to convex relaxations, it is not as cheap as the schemes we are considering due to the subproblem 
with fixed x, and no convergence results have been derived for it. 

6 Experiments 

Thus far, various algorithms have been provided with the goal of learning sparse rank one approximations. In 
this section, these different methods are compared. The algorithms considered here are /q -constrained PCA 
(novel iteration), an approximate greedy algorithm ifTOl . GPowerLl (/i -penalized PCA of |fT9l ). GPowerLO 
(/o-penalized PCA of |[T9l ). Expectation-Maximization (Zi -constrained PCA of |[30l ). and thresholding (se- 
lect k entries of principal eigenvector with largest magnitudes). We also consider an exact greedy algorithm 
and the optimal solution (via exhaustive search) for small dimensions (77, = 10). 

The goal of these experiments is two-fold. Firstly, we demonstrate that the various algorithms give 
very similar performance. The measure of comparison used is the proportion of variance explained by a 
sparse vector versus that explained by the true principal eigenvector, i.e., the ratio x'^Ax/v'^Av where x is 
the sparse eigenvector and v is the true principal eigenvector of A. The second goal is to solve very large 
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sparse PCA problems. The largest dimension we approach is n = 50000, however, as discussed above, the 
ConGradU algorithm applied to /q -constrained PCA has very cheap 0{mn) iterations and is limited only 
by storage of a data matrix. Thus, on larger computers, extremely large-scale sparse PCA problems (much 
larger than those solved even here) are also feasible. 

Note that we do not compare against all algorithms listed in Table [T] In particular, SPCA 11411 was 
already demonstrated to be computationally more expensive, as well to provide inferior performance, to 
GPowerLl and GPowerLO [19]. The Zi -constrained PCA method of f36| and Zo-penalized PCA method 
of IIBTI are also cheap methods that give similar performance (learned from experiments not shown in this 
paper) to the algorithms in our experiments. Finally, note that, for all experiments, we do a postprocessing 
step in which we compute the largest eigenvector of the data matrix in the fc-dimensional subspace that is 
discovered by the respective methods. 

All experiments were performed in MATLAB on a PC with 2.40GHz processor with 3GB RAM. Codes 
from the competing methods were downloaded from URL's available in the corresponding references. Slight 
modifications were made to do singular value decompositions rather than eigenvalue decompositions in 
order to deal with much smaller m x n data matrices rather than n x n covariance matrices. We first 
demonstrate performance on random matrices and follow with a text data example. 

6.1 Random Data 

We here consider random data matrices F G jjmxn ^^^ p_ ^ N{0, 1/m). It was already shown in 
literature on greedy methods ifTOl and convex relaxations |[TTll22l that random matrices of the form xx'^ + U 
where U is uniformly distributed noise are easy examples. Results here show that taking sparse eigenvectors 
of the matrix F^F is also relatively easy. 

The experiments consider n = 10 (m = 6) and n = 5000, 10000, 50000 (each with m = 150), 
each using 100 simulations. We consider Zq -constrained PCA with k = 2,... ,9 for n = 10 and k = 
5, 10, ... , 250 for the remaining tests. The optimal solution (found by exhaustive search) and the exact 
greedy algorithm (too computationally expensive for high dimensions) are only used when n = 10. 

Figure |2] compares performance of the various algorithms. Similar patterns are seen as n increases. For 
n = 10, optimal performance is obtained for almost every algorithm. For higher dimensions, we have no 
measure of the gap to optimality. As the dimension increases, the proportion of explained variation by using 
the same fixed cardinality decreases as expected. The next subsection shows that we do not necessarily need 
to explain most of the variation in the true eigenvector in order to gain interpretable factors. 

Results only up to a cardinality level of 250 variables are displayed because our goal is simply to com- 
pare the different algorithms. All algorithms, except for simple thresholding, perform very similarly. These 
figures do not sufficiently display the story, so we describe the similar pattern that occurs. The approximate 
greedy algorithm does best for smallest cardinalities, then the expectation-maximization scheme dominates, 
and at some point the novel Zq -constrained PCA scheme gives best performance at a higher level of explained 
variation. For n = 50000, we do not actually see this change yet because such little variation is explained 
with only 250 variables. Furthermore, it is important to notice that the thresholded solution is consistently 
and greatly outperformed by all other methods, suggesting that the performance results ai^e enhanced via 
the conditional gradient algorithm. These experiments simply show that these algorithms offer very similar 
performance, and hence we next compare them computationally. 

Figure[3]displays the computational time comparing the various algorithms for the different dimensions. 
Firstly, note that for penalized PCA problems (GPowerLO and GPowerLl), a parameter must be tuned 
in order to achieve the desired sparsity. Figures here do not account for time spent tuning parameters. 
Secondly, the greedy algorithms must compute the greedy solution at all sparsity levels of 1 , ... , 250 in 
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Figure 2: Plots show the average percent of variance explained by the sparse eigenvectors found by 
several algorithms, i.e., the ratio x^ {F'^ F)x / v^ {F^ F)v where x is the sparse eigenvector and v is 
the true first eigenvector F is an 777 x 77 matrix with Fij ^ N{0, 1/m). Sparsities of 2, . . . , 9 are 
computed for 77 = 10 and 5, 10, ... , 250 for the remaining experiments. 100 simulations are used to 
produce all results. 



order to obtain the solution with 250 variables, and hence the time displayed to compute greedy solutions is 
the cumulative time. Time for each algorithm is thus also taken as the cumulative time, albeit for all others 
it is the cumulative time to obtain a solution with sparsity levels of 5, 10, ... , 250. The novel /q -constrained 
algorithm requires an initial solution which we take as thresholded solution of the true principal eigenvector. 
The time to obtain that initial solution is marked as svdTime, however it need only be computed once for 
the entire path. 

For n = 10, the exact greedy algorithm is clearly the most expensive, requiring n maximum eigenvalue 
computations per iteration. For higher dimensions, the same pattern occurs. The expectation-maximization 
scheme requires the most time because the scheme implicitly solves a penalized problem and thus also 
implicitly tunes a parameter. GPowerLl is surprisingly (since the tuning time is not included) next. Despite 
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Figure 3: Plots show the average cumulative time to produce the sparse eigenvectors of F^F found 
by several algorithms, with F anm x n matrix with Fij ~ N{Q, l/m). Sparsities of 2, . . . , 9 are 
computed for 77 = 10 and 5, 10, ... , 250 for the remaining experiments. Note that cumulative times 
are given, i.e., the time to calculate the vector with 30 nonzeros adds up the time to compute vectors 
with 5, 10, . . . , 30 nonzeros, in order to compare with the approximate greedy method. The svdTime 
is the time required to compute the principal eigenvector of F^F which is used to compute an initial 
solution for ?o-constrained PCA. 100 simulations are used to produce all results. 



being cheap, it requires more iterations than other methods to converge. The approximate greedy algorithm 
follows, and is expected to be (relatively) computationally expensive because of the maximum eigenvalue 
computed at each iteration. This is followed by GPowerLO and finally by the cheapest scheme, the novel 
/o-constrained PCA iteration. 

We now discuss the advantages and disadvantages of the different schemes. Clearly, if the sparsity is 
known (or the sparsities desired is much less than the full path), the approximate greedy algorithm is much 
more computationally expensive. Comparing the other cheap schemes, the /o-constrained PCA scheme is 
cheapest (given the initial solution). The disadvantage of the penahzed schemes (GPowerLl and GPowerLO) 
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is that they must be tuned which is computationally very expensive (not shown). Warmstaiting could be 
used, for example, by initializing for A; = 10 based on the solution to k = 5 rather than from the thesholded 
solution. Thus, if the desired sparsity is known, the /q -constrained PCA scheme is clearly the algorithm to 
use. If not, then all of the algorithms are cheap, offer similar- performance, and can be used to derive a path 
of sparse solutions. 

6.2 Republicans or Democrats: What is the Difference? 

We consider here text data based on all State of the Union addresses from 1790-2011. Transcripts are 
available at http://stateoftheunion.onetwothree.net where other interesting analyses of this data are also done. 
Here, sparse PCA is used to further analyze these historical speeches. Questions one might ask relate to 
how the language in speeches has changed from George Washington through Barak Obama or how the 
relevant issues divide the different presidents. After taking the stems of words and removing commonly 
used stopwords, we created a bag-of-words data set based on all remaining words, leaving 12953 words 
(i.e., n = 12953 for this example). Our data matrix here is i? G jjmx" where m is the number of speeches 
and Bij is the number of times the j*'^ word occurs in the i^^ State of the Union address. We analyze two 
different sample sizes: using all speeches from 1790-2011 (m = 225) and only speeches from 1982-2011 
(m = 31). The rows of B are normalized so that each speech is of the same length. The following results 
are for PCA and sparse PCA performed on the covariance matrix A = B^B. 

Figure m displays the performance of the various algorithms on the text data set using the same measure 
as with random data above. For this high-dimensional data set, much fewer variables are needed to explain 
more variation relative to what was observed with the random matrices. Another major difference is that 
each algorithm gives exactly the same solution; even just taking the thresholded solution gives the same 
solution. This leads to postulate that real data often might contain a structure that makes it rather easy to 
solve (still, of course, we have no results on solution quality). Note that the /q -constrained scheme seems 
to be the cheapest algorithm here because, starting at the thresholded solution, it required one iteration to 
achieve convergence! Despite the simplicity of obtaining sparse solutions for this data, we continue to show 
what can be learned using this tool. The goal is to show that sparse factors offer interpretability that cannot 
be learned from using all 12953 variables. 

Figure[5](left) shows the result of projecting the data on the first two principal eigenvectoro The second 
factor cleai^ly clusters the speeches into two groups (roughly into positive versus negative coordinates in the 
second factor). Examination of these two groups shows a chronological pattern which as seen in the figure 
clusters those speeches that occurred before (and during) World War I with those speeches that were made 
after the war. Figure [5] (right) shows the data projected on sparse factors giving very a similar illustration. 
Factors 1 and 2 use 150 and 15 variables, respectively. Table |2] displays the words from the spai^se second 
factor and their sign. We roughly associate the positively weighted words with speeches before the wai^ and 
negatively weighted words with speeches after the war. Then one might interpret that, before World War I, 
presidents spoke about the United States of America as a collection of united states, but afterwords, spoke 
about the country as one american nation that faced issues as a whole. 

Figure|6]next shows a similar analysis using only speeches from 1982-201 1. A clear distinction between 
republicans and democrats was discovered. Again, sparse PCA is used to interpret the factors with 15 
variables each. Table [3] shows the most important words discovered by PCA (using thresholding) and sparse 
PCA for the top 3 factors. The first factor gives the same words for both analyses and no clear interpretation 
is seen. The second factors are more interesting. The thr^esholded PCA factor clearly relates to international 



We use projection deflation, as described in 1231 . to obtain multiple sparse factors. 
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Figure 4: The left plots shows the average percent of variance explained by the sparse eigenvectors 
and the right plots show the computational time to run the algorithms. Data is from the text of State 
of the Union addresses. Two different numbers of samples are used: all addresses from 1790-201 1 
and just those from 1982-2011. 



security issues. The sparse PCA factor, however, clearly focuses on domestic issues, e.g., health-care and 
education. Differences occur because PCA deflates with the true eigenvectors while sparse PCA deflates 
with the sparse factors. In any case, these are clearly issues that divide republicans and democrats. The third 
thresholded PCA factor seems to be related to educational reforms and funding them. The third sparse PCA 
factor is clearly related to fiscal policies and the economy. 

7 Concluding Remarks 

Sparse PCA admits a simple formulation which maximizes a (usually convex) quadratic objective subject 
to seemingly simple constraints. Nonetheless, it is a difficult nonconvex optimization problem, and a large 
literature has focused on various modifications of the desired problem in order to derive simple algorithms 
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Figure 5: The left plot shows the text of all State of the Union addresses from 1790-201 1 reduced 
from 12953 to 2 dimensions using PCA. The right plot shows the same data reduced to 2 dimensions 
using ?o-constrained PCA. Factor 1 is a function of 150 words, while Factor 2 is a function of only 
15 words. 
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Table 2: The table shows the top 15 our of 12953 words associated with the second factor derived 
from Zo -constrained PCA on the text of all State of the Union addresses from 1790-201 1 . The words 
for Factor 2 of thresholded PCA are almost identical. A first factor with 150 nonzero entries was 
deflated from the data. The signs of the words are given to show what drives the difference between 
the clusters in Figure|5] Note that words here are word stems, i.e. the words program and its plural 
programs are both represented by program. 
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Figure 6: The left plot shows the text of 31 State of the Union addresses from 1982-201 1 reduced 
from 12953 to 2 dimensions using PCA. The right plot shows the same data reduced to 2 dimensions 
using /o-constrained PCA where both factors are functions of exactly 15 words. 
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Sparse PCA Factors 1-3 
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Table 3: The left side of the table shows the top 15 words associated with the first 3 factors derived 
from PCA on the text of 31 State of the Union addresses from 1982-2011. The right side of the 
table shows the top 15 words when the 3 factors are derived using ^o -constrained PCA. The original 
number of dimensions is 12953. Note that words here are word stems, i.e. the words program and 
its plural programs are both represented by program. 



that hopefully produce good approximate solutions. No gaps to the solution of the original problem are 
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given by any of the cun^ently known schemes. 

In this paper, we have shown that the conditional gradient algorithm ConGradU: 

• can be directly applied to the original /q -constrained PCA problem Q without any modifications to 
produce a very simple scheme with low computational complexity. 

• can be successfully applied to maximizing a convex function over an arbitrary compact set and was 
proven to exhibit global convergence to stationary points. Efficiency of this scheme builds on the 
result that, while maximizing a quadratic function over the I2 unit ball with an Iq constraint is a 
difficult problem, maximizing a linear function over the same nonconvex set is simple. 

• provides a unifying framework to derive and analyze all new and old schemes discussed in the paper 
which, as we have seen, were derived from disparate approaches in the cited literature. As shown, all 
these algorithms which have been used in various applications are special cases of ConGradU. 

The overall message is that, for some difficult problems, we can achieve the same (Umited) theoretical 
guarantees and practical performance using the same algorithm on modified (seemingly easier) problems as 
we can on the original difficult problem. Furthermore, all of these algorithms emerging from ConGradU 
give similar performance in practice and with similar complexities. 

We conclude by showing that the same tools we have used for applying the ConGradU algorithm to 
sparse PCA can readily be applied to other sparsity-constrained statistical tools, e.g., sparse Singular Value 
Decompositions, sparse Canonical Correlation Analysis, and sparse nonnegative PCA. Note that our com- 
ments below about Zq -constrained problems can also be extended to the corresponding /i -constrained and 
^0/^1-penalized problems. 

Sparse Singular Value Decomposition (SVD) 

Sparse SVD solves the problem 

max{x'^By : \\x\\2 = 1, ||y||2 = 1, \\x\\o < h, \\y\\o <k2,xG R"',y £ R"} (41) 

where B G r^x" j^ the data matrix and ki,k2 are positive integers. Note that this is equivalent to Iq- 
constrained PCA when x = y and S G S". ll36l considered a relaxation to this problem as described in 
Section [5!2l They relaxed the Iq constraints on x and y with li constraints (and relaxed equalities to inequal- 
ities) and applied an alternating maximization scheme, where each optimization problem is easily solved via 
Proposition [T5I Following exactly the approach we suggested for Zq -constrained PCA, there is no need to 
relax the Iq constraints. Proposition [TT] can be used to solve directly the alternating optimization problems, 
giving rise to a simple algorithm for the Zq -constrained SVD problem (|4TI ). 

Sparse Canonical Correlation Analysis (CCA) 
Sparse CCA solves the problem 

max{x^5^Cy : x^B'^Bx = l,y'^C^Cy = 1, ||a;||o < fci, ||y||o <k2,x£ RP,y € R"} (42) 

where B G R^xp and C S R'^^'J are data matrices and ki,k2 are positive integers. f3^ suggest that 
useful (i.e., interpretable) results can still be obtained by substituting the identity matrix for B^B and C'^C 
in the constraints, resulting in the sparse SVD problem above. Rather, we propose substituting the diago- 
nals of B^B and C^C as proxies. Propositions [TT] and [T5] (among others in Section H]) can both easily be 
extended to optimizing over the constraints x^Dx = 1 and x^Dx < 1, where D is diagonal. A simple 
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alternating maximization algorithm then follows for the resulting Zo-constrained approximate CCA problem. 

Sparse Nonnegative Principal Component Analysis (PCA) 

Spai^se nonnegative PCA solves the problem 

maxjx^^x : ||x||2 = 1, ||x||o < /c,2; > 0,x G R"} (43) 

where A € R"'^" is a covariance matrix and /c is a positive integer. This is exactly the /q -constrained PCA 
problem Q with additional nonnegativity constraints. Simple extensions of Propositions [TT] and [15] lead to 
a simple scheme for /o-constrained nonnegative PCA based on the ConGradU algorithm. 
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